In preparing the dataset for analysis, we began by importing the raw tournament records from the specified source file into a working DataFrame. From this full set of events, we retained only men’s singles (MS) and women’s singles (WS) matches, since our focus is on head-to-head singles competition. The match date field was converted to a proper datetime format to enable accurate time-based feature construction and ordering. We then sorted the data chronologically so that all subsequent feature calculations respect the temporal sequence of matches, and reset the index to maintain a clean, continuous row numbering for downstream processing.
4.2 Feature engineering
The predictive model relies on a set of engineered features designed to capture player skill, recent performance trends, head-to-head dynamics, and network-level influence. Each feature was constructed to reflect only information available at the time of prediction, thereby avoiding target leakage. The following describes each feature, its derivation, and its intended role in improving model performance.
For each match, the pre-match Elo ratings of both the focal player and their opponent were recorded. Elo ratings were initialized at 1200 for all players and updated online using a constant K=32 after each match. By recording these ratings before any update, the features represent each player’s latent skill level immediately prior to the match without incorporating outcome information. This choice captures the evolving competitive balance between players and is well suited to sports prediction contexts where performance changes over time.
The Elo rating difference, computed as elo_player - elo_opponent, distills the relative strength of the two competitors into a single scalar measure. Positive values indicate that the focal player entered the match as the Elo favourite, while negative values suggest the opponent held an advantage. Using a difference metric rather than two separate ratings reduces collinearity and can improve interpretability for models sensitive to redundant inputs.
The score differential reflects the raw margin of points in a match, computed as the difference between the two sides’ total points. It is recorded before Elo updates to prevent leakage and is mirrored appropriately when generating the opponent’s perspective row. This metric adds context on the dominance or closeness of prior performances, potentially revealing patterns not visible from binary win/loss records alone.
4.4 Rolling Win %
Code
# ---------- Rolling win% (shifted) ----------for w in (5, 10, 20): df[f"win_pct_{w}"] = ( df.groupby("player_id")["win"] .transform(lambda s: s.shift(1).rolling(w, min_periods=1).mean()) )
Three rolling win percentage features were calculated over the most recent 5, 10, and 20 matches, each shifted by one match to exclude the current outcome. The use of multiple window sizes allows the model to detect both short-term momentum and longer-term consistency. Smaller windows respond quickly to form changes, while larger windows smooth short-term volatility and approximate a player’s baseline performance level.
An exponentially weighted moving average of prior head-to-head results was computed for each player–opponent pair, with a decay parameter α=0.1. This measure emphasizes recent encounters while retaining older results at diminishing weight. It captures stylistic or matchup-specific dynamics that are not fully explained by overall skill ratings, reflecting the idea that certain players may consistently perform better or worse against particular opponents.
The head-to-head decay feature was further adjusted by the relative strength of the opponent, computed as the ratio of the opponent’s Elo to the player’s Elo. This scaling accounts for the fact that beating a strong opponent is more informative than beating a weaker one. The adjustment helps to prevent misleadingly high head-to-head scores when the wins were accumulated against underperforming or low-ranked opponents.
PageRank scores were calculated on a directed graph constructed from matches in the training period only, with edges directed from losers to winners. This network-based metric rewards victories over players who themselves have many quality wins, thereby encoding strength-of-schedule information. By computing PageRank solely on training data, the process avoids incorporating future results into the feature set.